trace: add barebones ptrace setup #4401

nia-e · 2025-06-15T21:37:42Z

Per the review in #4326, this implements the absolute bare minimum ptrace bits; no access is actually intercepted or logged, but when running in native-lib mode Miri will now fork itself and trace the child process unless this is disabled via CLI flag. Currently the docs are left the same as in the original PR i.e. they might reference things that aren't implemented yet, but I can change that.

oli-obk

partial review, lunchtime, will review more later

src/alloc/isolated_alloc.rs

oli-obk · 2025-06-16T10:59:41Z

src/bin/miri.rs

+    #[cfg(target_os = "linux")]
+    if !miri_config.native_lib.is_empty() && !miri_config.force_old_native_lib {
+        // FIXME: This should display a diagnostic / warning on error
+        // SAFETY: No other threads have spawned yet


This seems like a footgun. If we register the ctlrc handler before this then suddenly it's UB? Not sure how to improve this, maybe there's a way to ask the process how many threads it has or sth?

Hm, to be fair we can play a little fast and loose here. The docs for fork() claim that after forking in a multithreaded context the child process must only do async-signal-safe things, but that's in a "it might be safe to do other things but it depends on what the other threads are doing" way, not in an "it's always UB to call async-signal-unsafe code" way. As long as we know what the ctrlc handler / other threads are doing, this is still sound to do in a multithreaded context. I'll check the code but given we're talking about an interrupt handler, I find it extremely unlikely that it's an issue

ok, leaving some more info on this safety comment about what you just told me works for me

AFAIK fork suspends/deletes all threads except for the one doing the fork. So why does it matter what those other threads do?

The problem with fork in a concurrent program is that those other threads may hold locks, which will now never be released. That's why async-signal-safety matters -- it's a lot like a signal handler, where the corresponding main thread may hold locks that will never get released while the signal handler runs.

src/shims/trace/child.rs

oli-obk · 2025-06-16T13:20:20Z

src/shims/trace/mod.rs

@RalfJung should we move this module and IsolatedAlloc into a separate crate within this repo that we depend on via a path dependency? Should speed up edit cycles by a small amount and generally improve separation (and thus allow unit testing nicely and potentially moving out of tree at some point). It worked for ui_test 😆

ping @RalfJung

thoughts?

oli-obk

yea this lgtm. Really cool that it's possible

src/shims/trace/parent.rs

src/shims/trace/child.rs

nia-e · 2025-06-16T14:40:26Z

Think this all addresses the review, thanks!

oli-obk

please rebase and squash

RalfJung · 2025-06-18T08:56:47Z

I don't think it's worth a separate crate tbh. But I have also not looked into this PR at all yet.^^

Apply suggestions from code review Co-authored-by: Oli Scherer <github35764891676564198441@oli-obk.de> review comments fix possible hang

nia-e · 2025-06-18T09:34:24Z

In transit currently so I can't build Miri locally to test this but if CI passes this should be fine, ty!

Out of sheer curiosity regarding the code which caused the merge conflict, is there any semantic difference between (*t_list).deref() and just (**t_list)? Can't say I've seen the former pattern before

oli-obk · 2025-06-18T09:43:01Z

Oh that was fallout from the metasized work. This may just be stuff from early drafts we forgot to fix later or it's something about inference. Unsure. But it should almost always be equivalent

RalfJung

I still have a bunch of questions and comments. Some of them are resolved in #4418, but a bunch are clarification questions, so I'd appreciate a follow-up PR extending the comments here and possibly doing some refactoring.

RalfJung · 2025-06-28T08:07:14Z

src/alloc/isolated_alloc.rs

+    /// Returns a vector of page addresses managed by the allocator.
+    pub fn pages(&self) -> Vec<usize> {
+        let mut pages: Vec<_> =
+            self.page_ptrs.clone().into_iter().map(|p| p.expose_provenance()).collect();


This allocates the buffer for these pages twice... that seems rather unnecessary.

RalfJung · 2025-06-28T08:09:25Z

src/bin/miri.rs

@@ -227,10 +227,11 @@ impl rustc_driver::Callbacks for MiriCompilerCalls {
        } else {
            let return_code = miri::eval_entry(tcx, entry_def_id, entry_type, &config, None)
                .unwrap_or_else(|| {
+                    #[cfg(target_os = "linux")]
+                    miri::register_retcode_sv(rustc_driver::EXIT_FAILURE);


Why is this done here but not all the other places where we exit?

It seems to have something to do with the way abort_if_errors() exits not properly getting us a return code. I'm unsure of what the root cause is - I spent a lot of time trying to debug that but it seems to be the only thing that causes this behaviour, so I special-cased it

abort_if_errors doesn't exit, it unwinds. catch_with_exit_code then turns that into an exit.

I think you should be able to remove the random register_retcode here, and then once #4406 lands add this inside the new fn exit there.

Or does ptrace halt execution on an unwind? It should just resume execution then to let the program run its normal course.

RalfJung · 2025-06-28T08:10:14Z

src/bin/miri.rs

+        // thread in an async-signal-unsafe way such as by accessing shared
+        // semaphores, etc.; the handler only calls `sleep()` and `exit()`, which
+        // are async-signal-safe, as is accessing atomics
+        let _ = unsafe { miri::init_sv() };


Why do we just ignore errors here...?

If it errored and the sv process failed to init, then we catch that when calling poll() on it. I would like this to emit a diagnostic on error, though; it just didn't seem necessary to include that as part of this PR

Are you saying we dynamically fall back to calling native code without fine-grained tracing if the setup fails?

That makes sense, but having a let _ then does not. Also, there should be a comment.

emit a diagnostic on error, though; it just didn't seem necessary to include that as part of this PR

If you have confusing things like let _ instead you need to explain those, or at least add FIXME(ptrace): emit a warning here or so.

EDIT: There actually is a FIXME, fair. I think I'd have rather seen this staged a bit differently but that's getting into very subjective territory. ;)

RalfJung · 2025-06-28T08:11:11Z

src/shims/trace/child.rs

+/// receiving back events through `get_events`.
+///
+/// # Safety
+/// The invariants for `fork()` must be upheld by the caller.


I don't understand this safety comment. It should at least provide an easily clickable references to those fork invariants.

I can add that, sure thing.

RalfJung · 2025-06-28T08:11:29Z

src/bin/miri.rs

+        // handler), they will not interact with anything on the main rustc/Miri
+        // thread in an async-signal-unsafe way such as by accessing shared
+        // semaphores, etc.; the handler only calls `sleep()` and `exit()`, which
+        // are async-signal-safe, as is accessing atomics


Is sounds like this relying on the ctrlc internal implementation details that could change any time...?

I also have no clue why we are talking about threads here, but that may be because init_sv doesn't have a self-contained safety comment.

I find it somewhat hard to imagine how the ctrlc handler could realistically be changed to act in ways that may break the safety invariants for signal safety, but for clarity I'll add a better safety comment for init_sv.

The main point is that I don't think it even matters what the ctrlc handler does. I think you misunderstood the safety requirements of fork.

RalfJung · 2025-06-28T08:55:51Z

src/shims/trace/parent.rs

+    // reason to use a different one
+    let mut curr_pid = init_pid;
+
+    // There's an initial sigstop we need to deal with


Where does that come from?

init_sv() raises a sigstop on the child arm, to wait in place and make sure that the parent process can ptrace it properly. If it can't, then the child is killed instead of being resumed from sigstop

Seems like we need documentation somewhere for the overarching protocol of who raises which signal when and what happens then.

I put that in trace/messages.rs but I can expand on it if it's insufficient as-is

This code here runs before the main event loop, the comment in messages.rs explains the body of the loop IIUC.

I'll put it in the docs for init_sv then since that's the most obvious place I think

It's part of the protocol, I think it makes sense to have the entire protocol in one place somewhere. So I'd add it to messages.rs. The text there could also make it more clear that those steps there are being repeated arbitrarily often, and which message types are used for which of these messages.

RalfJung · 2025-06-28T08:57:44Z

src/shims/trace/parent.rs

+                confirm_tx.send(Confirmation).unwrap();
+                // We can't trust simply calling `Pid::this()` in the child process to give the right
+                // PID for us, so we get it this way
+                curr_pid = wait_for_signal(None, signal::SIGSTOP, false).unwrap();


Who's ending this SIGSTOP and where? Seems like there's a protocol here that I haven't seen documented so far.

This is the sigstop that the child raises in start_ffi(). Then calling ptrace::syscall() is what ends it (it's the same as ptrace::cont() but also pauses the child when it enters or exits a syscall; right now unused, but will be important for the malloc tracing later)

RalfJung · 2025-06-28T08:58:38Z

src/shims/trace/parent.rs

+            // Child entered a syscall; we wait for exits inside of this, so it
+            // should never trigger on return from a syscall we care about


I don't understand this comments. Who's waiting for what where and why?

Bah, this is wrong. I should have edited this when I gutted the malloc tracing bits, apologies

RalfJung · 2025-06-28T08:59:51Z

tests/native-lib/pass/ptr_read_access.stderr

I am very surprised that this passed CI since we should be running this on macOS as well. Any idea what is happening here?

I cfg(linux)'ed all of the trace-related bits since macOS has a very barebones (bad) ptrace implementation, so when running on it it'll just use the old behaviour

Right, so macos doesn't get the fine-grained ptrace. But macos apparently still prints the error that should only be printed when using fine-grained ptrace. So there's clearly a bug here, and your reply doesn't explain it.

RalfJung · 2025-06-28T09:19:51Z

src/shims/trace/child.rs

shims/trace is a bit ambiguous IMO, let's move this inside a native_lib module.

oli-obk requested changes Jun 16, 2025

View reviewed changes

oli-obk reviewed Jun 16, 2025

View reviewed changes

src/shims/trace/child.rs Show resolved Hide resolved

oli-obk reviewed Jun 16, 2025

View reviewed changes

src/shims/trace/parent.rs Outdated Show resolved Hide resolved

src/shims/trace/child.rs Outdated Show resolved Hide resolved

nia-e force-pushed the barebones-ptrace branch from 17ff063 to a390068 Compare June 16, 2025 14:39

oli-obk approved these changes Jun 18, 2025

View reviewed changes

minimal ptrace setup

e815550

Apply suggestions from code review Co-authored-by: Oli Scherer <github35764891676564198441@oli-obk.de> review comments fix possible hang

nia-e force-pushed the barebones-ptrace branch from 5171739 to e815550 Compare June 18, 2025 09:29

oli-obk enabled auto-merge June 18, 2025 09:43

oli-obk added this pull request to the merge queue Jun 18, 2025

Merged via the queue into rust-lang:master with commit a87c442 Jun 18, 2025
8 checks passed

RalfJung mentioned this pull request Jun 28, 2025

various minor native-lib-tracing tweaks #4418

Merged

RalfJung reviewed Jun 28, 2025

View reviewed changes

RalfJung mentioned this pull request Jul 2, 2025

native_lib/trace: fix and reenable #4435

Merged

		// Child entered a syscall; we wait for exits inside of this, so it
		// should never trigger on return from a syscall we care about

trace: add barebones ptrace setup #4401

trace: add barebones ptrace setup #4401

Uh oh!

Conversation

nia-e commented Jun 15, 2025 • edited by rustbot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oli-obk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

oli-obk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

nia-e commented Jun 16, 2025

Uh oh!

oli-obk left a comment

Choose a reason for hiding this comment

Uh oh!

RalfJung commented Jun 18, 2025 via email

Uh oh!

nia-e commented Jun 18, 2025

Uh oh!

oli-obk commented Jun 18, 2025

Uh oh!

Uh oh!

RalfJung left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RalfJung Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

nia-e commented Jun 15, 2025 •

edited by rustbot

Loading

RalfJung Jul 2, 2025 •

edited

Loading